{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# COMPSCI 389: Homework 5\n", "\n", "**Assigned**: April 24, 2024. **Due**: May 2, 2024 at 2:00pm Eastern. **Note**: Submissions received after 2:00pm Eastern on May 9, 2024 will receive no credit.\n", "\n", "**Submitting**: Upload your submission on Gradescope as a `.pdf`. Converting to a PDF can be a complicated process, and so we encourage you to test this process well in advance of the submission deadlines. We recommend converting to HTML, opening the HTML file in a browser, and then printing or exporting to a PDF from your browser. We do not recommend directly converting to a PDF, since this requires installing xelatex. To convert to HTML in VSCode, press `ctrl+shift+p` and type `export`, and you should see an option to export to HTML.\n", "\n", "**Note**: Keep your `.ipynb` file, as we may request it directly (via email).\n", "\n", "**Note**: When converting to a PDF file, ensure that all of your code cells have been executed. The results of these executions *must* be included in your submitted PDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note: Different Instructions\n", "\n", "Unlike past assignments, this one does not come with solutions. You should complete this assignment on your own without using any materials other than those posted on the course webpage. However, you may reference online documentation for Python or Jupyter notebooks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overview\n", "\n", "In this assignment we will implement the MENACE algorithm and plot a **learning curve** that characterizes how it learns. Some code will be provided for you, but you will be asked to add some missing parts of the code. Specifically, you should add \"Step 9\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Import Statements\n", "\n", "Let's begin with the import statements that we will use. You should not use any additional import statements without prior approval from the instructor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "import time\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Implementing Tic-Tac-Toe (Naughts and Crosses)\n", "\n", "In this section we implement tic-tac-toe. This code has been provided, but you are encouraged to look through it so that you are familiar with the details.\n", "\n", "We will represent the board as a $3 \\times 3$ matrix with values 'X', 'O', or ' ' (empty). First, let's write a function to print the board." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def print_board(board):\n", " for row in board:\n", " print(\" | \".join(row))\n", " print(\"-\" * 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's write a function to check for a winner. This function returns 'X', 'O', 'Draw', or None (game still in progress). It works by creating a tuple of tuples (array of arrays) `lines`. Each element of `lines` is an array of length 3 containing the pieces (X, Y, empty) along one possible line where a player could get three pieces in a row. Once this tuple of tuples has been created, it loops over `lines` to check whether any of these contain the same value (that isn't ' ', which denotes empty). If there are now winners yet, it checks whether there are any empty squares. If so, the game is not yet over. If there are no empty squares, the game is a draw." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def check_winner(board):\n", " # Load lines with all of the sequences on the board where a player could get three in a row\n", " lines = (\n", " # Horizontal lines\n", " board[0], board[1], board[2],\n", " # Vertical lines\n", " [board[0][0], board[1][0], board[2][0]], \n", " [board[0][1], board[1][1], board[2][1]], \n", " [board[0][2], board[1][2], board[2][2]],\n", " # Diagonals\n", " [board[0][0], board[1][1], board[2][2]], \n", " [board[2][0], board[1][1], board[0][2]]\n", " )\n", "\n", " # Loop over all of these lines\n", " for line in lines:\n", " # Check whether all three elements in this line are the same, and not empty\n", " if line[0] == line[1] == line[2] and line[0] != ' ':\n", " # There is a winner! Return the winner, ('X' or 'O')\n", " return line[0]\n", " \n", " # Check for a draw (no empty squares)\n", " if all(all(cell != ' ' for cell in row) for row in board):\n", " return 'Draw'\n", " \n", " # The game isn't over yet!\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's write functions:\n", "- `valid_moves`: Returns a list of all empty spaces where plpayers can make a move. Each element of this list is an $(i,j)$ tuple.\n", "- `make_move`: Updates the board with the player's move if the chosen position is empty. It returns `True` on success and `False` on failure (we will use this for error checking).\n", "- `initialize_board`: Generates a $3 \\times 3$ tic-tac-toe board initialized with empty spaces." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def valid_moves(board):\n", " return [(i, j) for i in range(3) for j in range(3) if board[i][j] == ' ']\n", "\n", "def make_move(board, move, player):\n", " i, j = move\n", " if board[i][j] == ' ':\n", " board[i][j] = player\n", " return True\n", " return False\n", "\n", "def initialize_board():\n", " return [[' ' for _ in range(3)] for _ in range(3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Implementing Players\n", "\n", "We will create a class for \"players\". One player will be MENACE, another will select moves randomly. Below is the implementation of the base class. It has three functions:\n", "\n", "- `get_move`: Presents the player with a board and asks for a move. The `symbol` is an additional argument saying which symbol, X or O, the player is playing as.\n", "- `learn`: This is called at the end of a game and has the games result. The player can use this to change its policy. Like `get_move`, this function also takes `symbol` as an argument, telling the player which symbol it was playing as.\n", "- `reset`: This is used to reset the agent so that multiple trials can be run." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class Player:\n", " def get_move(self, board, symbol):\n", " raise NotImplementedError(\"This method should be overridden by subclasses\")\n", " \n", " def learn(self, result, symbol):\n", " raise NotImplementedError(\"This method should be overridden by subclasses\")\n", " \n", " def reset(self):\n", " raise NotImplementedError(\"This method should be overridden by subclasses\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the `RandomPlayer` class, which always selects moves uniformly randomly from the legal moves." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class RandomPlayer(Player):\n", " def get_move(self, board, symbol):\n", " moves = valid_moves(board)\n", " return random.choice(moves)\n", " \n", " def learn(self, result, symbol):\n", " pass # Random player does not learn\n", " \n", " def reset(self):\n", " pass # No state to reset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: Playing a Game\n", "\n", "Next we provide code to play a game. This function takes as input two players (e.g., the `random_player` function that we wrote), and has the players play one game. It returns the winner, with the first player being 'X' and the second player being 'O'." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def play_game(player1, player2):\n", " # Start with blank initial board\n", " board = initialize_board()\n", "\n", " # Store the current and next player. These will be swapped after each turn\n", " current_player, next_player = player1, player2\n", "\n", " # Also store the current and next symbol (also swapper after each turn).\n", " current_symbol, next_symbol = 'X', 'O'\n", " \n", " # Loop over time (moves)\n", " while True:\n", " # Query the current player for a move\n", " move = current_player.get_move(board, current_symbol)\n", "\n", " # Make the move\n", " if not make_move(board, move, current_symbol):\n", " # Handle illegal or failed moves. Print something to help with debugging this case!\n", " print(f\"WARNING: Illegal move by {current_symbol}.\")\n", " current_player.learn('Loss', current_symbol)\n", " next_player.learn('Win', next_symbol)\n", " return next_symbol # Opponent wins if illegal move\n", "\n", " # Check whether the game is over\n", " winner = check_winner(board)\n", " if winner:\n", " # If we get here, the game is over. Call the \"learn\" functions\n", " if winner == 'Draw':\n", " player1.learn('Draw', 'X')\n", " player2.learn('Draw', 'O')\n", " else:\n", " player1.learn('Win' if winner == 'X' else 'Loss', 'X')\n", " player2.learn('Win' if winner == 'O' else 'Loss', 'O')\n", " return winner\n", " \n", " # Swap players and symbols for the next turn\n", " current_player, next_player = next_player, current_player\n", " current_symbol, next_symbol = next_symbol, current_symbol" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: Running and Experiment\n", "\n", "Next we provide code to have two players play `games_per_trial` games against each other. Right now the players don't learn, but when we implement MENACE we expect it to learn during these games. However, playing this number of games might give different results if we ran it many times. To capture this, we repeat this process `num_trials` times, storing the results.\n", "\n", "That is, the players play `num_trials` sequences of `games_per_trial` against each other, and the outcomes of all games are stored.\n", "\n", "The results that are returned are *rewards* from the perspective of `player1`. The rewards are (again, from `player1`'s perspective):\n", "- Win: +3\n", "- Loss: -1\n", "- Draw: +1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_trials(num_trials, games_per_trial, player1, player2):\n", " # Create a matrix where we will store the results. It has num_trials rows and games_per_trial cols\n", " results_matrix = np.zeros((num_trials, games_per_trial))\n", "\n", " # Loop over the trials\n", " for trial in range(num_trials):\n", " # Reset the players so they learn from scratch\n", " player1.reset()\n", " player2.reset()\n", "\n", " # Loop over games in this trial\n", " for game in range(games_per_trial):\n", " # Have player1 go first for even trials and player2 go first for odd trials\n", " if game % 2 == 0:\n", " result = play_game(player1, player2)\n", " reward = 3 if result == 'X' else 1 if result == 'Draw' else -1\n", " else:\n", " result = play_game(player2, player1)\n", " reward = 3 if result == 'O' else 1 if result == 'Draw' else -1\n", " \n", " # Log the result as a reward value\n", " results_matrix[trial, game] = reward\n", " \n", " # Return the computed results\n", " return results_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 6: Plotting the Results\n", "\n", "Next, we provide code to plot the results. This will be a plot with the number of games played on the horizontal axis and the resulting reward on the vertical axis. We will average the results from the `num_trials` trials that are run, and will include standard-error error bars.\n", "\n", "**Note**: To reduce the number of points plotted, this plots a point ever 100 games. Each point reports the average reward over the last 100 games.\n", "\n", "**Note**: We also include a red dashed line at 1.0, which corresponds to roughly the value achieved when neither agent learns. MENACE should be able to improve upon this line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_learning_curve(results_matrix):\n", " # Calculate the mean rewards and standard errors across trials for each game\n", " mean_rewards = np.mean(results_matrix, axis=0)\n", " standard_errors = np.std(results_matrix, axis=0) / np.sqrt(results_matrix.shape[0])\n", " games = np.arange(1, results_matrix.shape[1] + 1)\n", "\n", " # Averaging over intervals of 100 games\n", " interval = 100\n", " num_intervals = (len(mean_rewards) + interval - 1) // interval\n", " averaged_means = []\n", " averaged_errors = []\n", " averaged_games = []\n", "\n", " for i in range(num_intervals):\n", " start_index = i * interval\n", " end_index = min(start_index + interval, len(mean_rewards))\n", " \n", " # Calculate the mean of means and the mean of errors for the interval\n", " interval_means = mean_rewards[start_index:end_index]\n", " interval_errors = standard_errors[start_index:end_index]\n", "\n", " averaged_means.append(np.mean(interval_means))\n", " averaged_errors.append(np.mean(interval_errors))\n", " averaged_games.append(np.mean(games[start_index:end_index]))\n", "\n", " # Create the plot\n", " plt.figure(figsize=(10, 6))\n", " plt.plot(averaged_games, averaged_means, '-', label='Average Reward', color='blue')\n", " plt.fill_between(averaged_games, np.array(averaged_means) - np.array(averaged_errors), np.array(averaged_means) + np.array(averaged_errors), color='lightblue', alpha=0.5)\n", " plt.title('Learning Curve over Multiple Trials')\n", " plt.xlabel('Game Number')\n", " plt.ylabel('Average Reward')\n", " plt.axhline(y=1, color='red', linestyle='--', label='No Learning Benchmark')\n", " plt.ylim(-1.5, 3.5)\n", " plt.legend()\n", " plt.grid(True)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 7: Example Run\n", "\n", "Here is how we can run this code to obtain a **learning curve** for the random player playing against itself:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_trials = 100\n", "games_per_trial = 1000\n", "player1 = RandomPlayer()\n", "player2 = RandomPlayer()\n", "results = run_trials(num_trials, games_per_trial, player1, player2)\n", "plot_learning_curve(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 8: Writing MENACE\n", "\n", "Finally, we're ready to implement MENACE! We will create a MenacePlayer class. The constructor takes an argument, `initial_beads`, that specifies how many beads (for each move) are placed in the matchboxes at the start. For this assignment, your task is to implement MENACE. Fill in the missing code below.\n", "\n", "\n", "\n", "**NOTE:**\n", "You do **not** need to handle symmetric boards. That is, the real MENACE method treats two equivalent (but different) boards as being the same matchbox. For example, a board with only an X in the top-left and a board with only an X in the top-right are effectively the same, and a single matchbox could be used to handle both of these game states. Without accounting for these symmetric game states, there are roughly 20,000 possible boards in tic-tac-toe. Accounting for symmetries, the number of possible states can be reduced to just 304! Because our implementation does not account for these symmetries, it must effectively learn the right behavior for the \"same\" position many times independently. This will make our implementation much slower to learn than the real MENACE.\n", "\n", "\n", "\n", "\n", "\n", "**NOTE:**\n", "You are encouraged to implement the variant of the MENACE algorithm described in class, although you are welcome to implement the full algorithm described in the paper \"Experiments on the mechanization of game-learning Part I. Characterization of the model and its parameters\" by Donald Michie.\n", "\n", "\n", "\n", "Although your function specifications must remain unchanged, you are welcome to implement each function in any way you would like. We have provided one additional function that you may consider using (you do not need to use it and can delete it): `matchbox_key(self, board)`. This function maps a board to an integer that can be used as the key in a dictionary. Note that not all board states are possible, so this function doesn't map possible board states to consecutive integers. Rather, it ensures that a different integer is given for each possible board state." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MenacePlayer:\n", " def __init__(self, initial_beads):\n", " self.initial_beads = initial_beads # The number of beads of each color to add at the start of each trial (here, and when reset is called)\n", " ### TODO: ENTER YOUR CODE HERE\n", "\n", " def reset(self):\n", " ### TODO: ENTER YOUR CODE HERE. This function should completely reset the MenacePlayer so that it is as though it was a fresh new player\n", " pass\n", "\n", " def get_move(self, board, symbol):\n", " ### TODO: ENTER YOUR CODE HERE. Note that you'll need to store the boards and moves until the learn function is called at the end of the game\n", " ### NOTE: Our solution does not reference `symbol`.\n", " ### NOTE: The code below is for selecting a random move and should be replaced\n", " moves = valid_moves(board)\n", " return random.choice(moves)\n", "\n", " def learn(self, result, symbol):\n", " ### TODO: ENTER YOUR CODE HERE.\n", " ### NOTE: Our solution does not reference `symbol`.\n", " ### NOTE: To avoid errors, either do not allow the last bead to ever be removed from a matchbox or ensure that having an empty matchbox is properly handled\n", " ### NOTE: If you're storing a log of the game in get_move calls and using it here, don't forget to clear that log at the end of this function!\n", " pass\n", "\n", " ### TODO: YOU MAY DEFINE ADDITIONAL HELPER FUNCTIONS HERE.\n", " def matchbox_key(self, board):\n", " key = 0\n", " multiplier = 1\n", " for row in board:\n", " for cell in row:\n", " value = {'-': 0, 'X': 1, 'O': 2}.get(cell, 0)\n", " key += value * multiplier\n", " multiplier *= 3\n", " return key \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 9: Results!\n", "\n", "Let's see how MENACE does against the random opponent! Run the code below with your complete MENACE implementation to produce the resulting learning curve. When working on your code and debugging, we recommend reducing the number of trials so that this runs faster. For your final submission, ensure that `num_trials = 100` and the resulting plot is included! On my desktop, my solution took 4 minutes to generate this final plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_trials = 100\n", "games_per_trial = 50000\n", "player1 = MenacePlayer(3)\n", "player2 = RandomPlayer()\n", "results = run_trials(num_trials, games_per_trial, player1, player2)\n", "plot_learning_curve(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 11: Optional\n", "\n", "This section is not for credit and will not be evaluated. However, there are a few ways that you could modify your program for fun:\n", "\n", "1. Write your own fixes strategy as a new Player object, and see how MENACE fares against this fixed strategy. To make this easier, we included the `symbol` argument in `get_move`. This *can* be inferred from the game state, and so it's not necessary, but it simplifies this process.\n", "2. Modify your code to account for symmetric board positions, and see how much faster MENACE learns!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }